Skip to content

Conversation

@wbruna
Copy link

@wbruna wbruna commented Aug 8, 2025

This is a straight implementation that adds controls and command-line switches for the flash_attention, diffusion_conv_direct and vae_conv_direct flags.

For Conv2D Direct, see leejet/stable-diffusion.cpp#744 for details. But in a nutshell: VAE decoding can be ~2.4-3.8x faster, with less than a half memory usage (allowing 1024x1024 generations without VAE tiling). The diffusion process currently doesn't seem to benefit from it, so the main reason to enable the control would be exposing the functionality to more testing.

The main drawback is a possible crash for unsupported backends. We could in principle only honor the flag for known to be working configurations, or blacklist broken ones. On my tests, it's currently working on CPU and Vulkan, while ROCm crashes.

The separate flash attention control is useful to avoid certain bugs, like leejet/stable-diffusion.cpp#756 , and configurations where flash attention benefits text generation, but slows down image generation.

Depends on #1669 .

@wbruna wbruna force-pushed the kcpp_sd_perf_flags branch 7 times, most recently from 07d0e92 to 941e502 Compare August 12, 2025 15:58
@wbruna wbruna marked this pull request as ready for review August 12, 2025 15:58
@LostRuins
Copy link
Owner

In what scenarios does it break (is it model dependent) or is it purely backend dependent? Also, are there any drawbacks to it quality/perf wise?

If it's an all round improvement, rather than overcomplicate with too many flags, would it be better to just force-enable it for supported backends? or is it conditionally broken?

@wbruna
Copy link
Author

wbruna commented Aug 13, 2025

It directly depends on the implementation of a specific operation, so it's unlikely to crash on a specific model (except in the sense that a model family which doesn't use that operation would of course always work, but I think all currently do).

In terms of performance, it's a bit too early to be sure. When it helps, it helps a lot, but it's not clear if it could get worse for some backend+model+hardware combinations. Quality seems to be the same, although outputs may vary a bit; in my tests, the VAE output changes less than it does with tiling.

Another possibility could be enabling the flags unconditionally for known-to-be-working cases, but including an environment variable to work around possible issues (either forcing it on to test a new backend, or off to avoid a special case).

@wbruna
Copy link
Author

wbruna commented Aug 13, 2025

it's not clear if it could get worse for some backend+model+hardware combinations

Already found an example - my own card! VAE timings for rendering a 512x512 SDXL image, radv versus amdvlk:

radv amdvlk
base 3.54s 3.74s
vae-conv-direct 0.86s 11.46s

It's even more dramatic for diffusion-conv-direct : ~20% worse on radv, 5x worse on amdvlk.

@LostRuins LostRuins force-pushed the concedo_experimental branch from 1977086 to c6f7603 Compare August 14, 2025 16:16
@LostRuins
Copy link
Owner

Alright how about this:

  • we can add the extra new toggle for SD flash attention
  • we add a new toggle for conv direct, but we combine both of them into a single flag for simplicity. The flag toggles them both. How about --sdconvdirect (avoid 'dir' as that usually means directory here)

@wbruna wbruna force-pushed the kcpp_sd_perf_flags branch from 941e502 to 15a4f9f Compare August 16, 2025 14:51
@wbruna
Copy link
Author

wbruna commented Aug 16, 2025

we add a new toggle for conv direct, but we combine both of them into a single flag for simplicity. The flag toggles them both. How about --sdconvdirect (avoid 'dir' as that usually means directory here)

A single boolean option would be a bit annoying, to be honest :-) My card gets 20% slower for the diffusion step, so it could be slower overall for high step counts, even if the VAE is much faster.

But a non-boolean sdconvdirect would be fine. I've implemented one with three values: disabled, enabled, and 'vaeonly' to enable it only for the VAE (I don't think we'll find a case where only the the VAE would be slower, but it's easy to add an extra value for that if you wish).

@wbruna wbruna force-pushed the kcpp_sd_perf_flags branch from 15a4f9f to d485089 Compare August 17, 2025 01:01
@LostRuins
Copy link
Owner

Anyway let me know when its ready to review

@wbruna wbruna force-pushed the kcpp_sd_perf_flags branch from d485089 to e69878c Compare August 17, 2025 03:28
@wbruna
Copy link
Author

wbruna commented Aug 17, 2025

Should be good to go now... except maybe for the combobox control. I tried to follow the other labeled controls as examples; it works, but you'll probably want to adjust its position and padding. It's being called only at one point right now, but I may make use of it in another (unrelated) PR.

@LostRuins
Copy link
Owner

You added sd_diffusion_convdir_var but it is not used anywhere

@LostRuins LostRuins added the enhancement New feature or request label Aug 19, 2025
@wbruna wbruna force-pushed the kcpp_sd_perf_flags branch from 47326af to df2a43f Compare August 19, 2025 09:47
@wbruna
Copy link
Author

wbruna commented Aug 19, 2025

Cleaned up that var, and rebased on top of concedo_experimental.

Edit: while working on #1692 , I started to think on and off could be better than enabled and disabled; or maybe off, vaeonly and full. I don't really have a strong preference, but maybe it'd be best to choose values that could be the same for other options, for consistency?

@LostRuins
Copy link
Owner

I think the disambiguation is good. So long as the commands are clear it's good. Generally I try to avoid enabled/on/off/disabled flags because that is implied with store true, i.e. --usemmap means mmap is enabled. So if 3 options are needed I think keeping it explicit is ok

@wbruna wbruna force-pushed the kcpp_sd_perf_flags branch from df2a43f to ab2f092 Compare August 19, 2025 21:55
Copy link
Owner

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be good, merging

@LostRuins LostRuins merged commit 6003e90 into LostRuins:concedo_experimental Aug 20, 2025
@wbruna wbruna deleted the kcpp_sd_perf_flags branch August 22, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants